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Abstract 



1^ ' Human labeled datasets, along with their corresponding evaluation algorithms, 

' play an important role in boundary detection. We here present a psychophysical 

experiment that addresses the reliability of such benchmarks. To find better reme- 
C/3 ' dies to evaluate the performance of any boundary detection algorithm, we propose 

^ , , a computational framework to remove inappropriate human labels and estimate 

the intrinsic properties of boundaries. 

> 
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9P ! 1 Introduction 

G^ 
in 

' Many problems in human and in computer vision are ill-defined. In problems such as boundary 

I detection, there is no objective measurement that determines whether there is a perceptually mean- 

i ingful boundary in any location in an image. To benchmark the performance of a boundary detection 

algorithm, human labeled datasets (e.g. BSDS300 \2 \ with 200 training images and 100 testing im- 
ages) play a critical role. These datasets characterize the perceptual definition of boundaries in an 
implicit way by providing exemplar images that have been labeled by a small number of human 
I subjects. 

, However, labelers do not always agree with each other Variability is intrinsically related to the ill- 

" " " defined nature of boundary detection. Yet there is surprisingly little discussion of data variability for 

boundary detection and its effect on benchmarks. It is commonly held that the labelers of boundary 
datasets (such as BSDS300) are reliable. Examined separately, each boundary seems to be reason- 
able with some underlying edge in the image. In |2| Martin et al. considers label variability to be 
due to different labelers drawing in different levels of details. |2| believes that even though a labeler 
may scrutinize some parts of the image in considerable detail, while drawing cursory sketches on 
other parts, different labelers are consistent in a sense that the dense labels refine the sparse labels 
without contradicting them. In other words, these different instances of labels all come from the 
same perceptual hierarchy of an image. 

Nevertheless, local consistency within a specific region is not strong enough to legitimatize the entire 
benchmark. To be able to faithfully evaluate an algorithm, the benchmark data has to be free from 
both type I (false alarm) and type II (miss) statistical errors. Even though boundaries in a benchmark 
dataset seem to be reasonable, it is still possible that the labelers may miss some equally important 
boundaries, leaving us with an imperfect benchmark. Such benchmark that contains type II errors 
may incorrectly penalizing an algorithm that detects true boundaries. 

We here propose a framework to analyze the quality or benchability of any benchmark, and demon- 
strate with a quantitative experiment that the current dataset for benchmarking can be improved. 
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2 Evaluating the risk of a boundary benchmark 



Although different human labels of the same boundary often contain spatial offsets up to several 
pixels, they rarely contradict each other [2 1 (e.g. with one drawing a horizontal and the other a 
vertical boundary at the same location). Based on these observation, we can merge boundary maps 
of the same image labeled by different subjects into one master map y. At each pixel location 
i, the response of labeler I is a binary value {i.e., edge or non-edge), = [yl , . . . ,yl . . . yf] 
concatenates the response of all labelers. We use the assignment algorithm and parameters of [3 1 to 
determine whether to merge adjacent lines from different subjects at one location. 

To evaluate the correctness of a benchmark, we used a two-way forced choice paradigm (shown in 
FigD- In any one trial, a subjecQis asked to compare the relative perceptual strength of two local 
boundary segments. Similar to fT], we do not give specific instructions that could potentially bias the 
result towards one particular type of boundary. The advantage of this two-alternative experiment is 
that it cancels out most of the fluctuations of cognitive factors, such as spatial attention bias, subject 
fatigue, and decision thresholds that are different in each subject. Moreover, compared to the tedious 
labeling process, this paradigm is much simpler and cheaper to implemented via crowd-sourcing. 

Given sufficient number of comparisons and subjects, we can determine the relative perceptual 
strength of any pair of boundary segments. This framework yields a strict total ordering on the 
set of boundaries. We can map the boundary set onto the interval [0, 1] by assigning each boundary 
segment a real-value x. This value x can be considered as the perceptual strength of the boundary, 
because a boundary segment with large x, by definition, is stronger (i.e., chosen more frequently 
by subjects) than another boundary with smaller x. Let S be the set of all boundaries in a dataset. 
Si be one boundary segment from S, and Xi be its perceptual strength. We can define the risk of a 
boundary set S in relationship to a boundary set A generated by some reference algorithm as: 

R{S, A) = P{x, < Xj \si£S, Sj e A\S). (1) 

This paradigm allows us to assess the risk associated with any dataset, such as BSDS300. Because 
of its great popularity, we choose pB boundaries [3 1 as the reference algorithm set A. We choose 
the pB threshold such that the number of boundaries in A is the same as in S (^A = 4t^S). To 
further illustrate the effect, we further restrict the sampling of human labels Si within a subset we 
call orphan labels , which refers to the boundaries that are labeled by only one labeler {S^ ~ 

{si I X^^^i y\ — 1}) not by the other L — 1 labelers 0. 30.88% of the entire boundary set of 
BSDS300 are orphan labels. 

We used 5 subjects to compare 100 pairs of boundary segments comparison (500 trials in total). For 
each pair, we use the mode response of all 5 subjects to determine the ordering. The mean risk of 
is 0.44. That is, almost half of the time, a "false alarm" algorithmic boundary is perceptually stronger 
than the orphan label, which would usually be consider "ground truth". Given the large fraction of 
orphan labels (almost one third of all boundaries), this leaves the vaUdity of using BSDS300 to 
benchmark any one algorithm in doubt. 

Given threshold r, there exist a perfect boundary set Sr that has zero risk, such that Xi > t for 
any si e St, and Xj < t for any Sj ^ St- This perfect set can be formed by examining boundary 
strength from all possible boundaries from all images. However, the current imperfect boundary 
set S annotated by a finite number of unreliable labelers lacks the information of a vast majority of 
unlabeled pixels. There is a probability such that a "qualified" boundary Si with Xi > t exists in the 
unlabeled pixels. This probability decreases as t increases, because a relatively strong boundary is 
less likely to be overlooked by all labelers. In fact, by taking the extremal threshold r > 1, we end 
up with a trivial solution: a risk-free but useless empty boundary seO- 

In this paper, we restrict our analysis within existing boundary labels in BSDS300, and try to infer 
the perceptual strength for each boundary segment. Inferred perceptual strengthes allow the user 

'We refer to labelers as the people who originally labeled the BSDS300 dataset, while subjects refers to 
people we recruited that perform our two-way forced choice experiment. 

^Sampling human labels from while algorithm algorithm labels from A\S makes the procedure slightly 
different from the original Eq. [T] 

^The other trivial solution is the original set by setting r to 0. 
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Figure 1 : An illustration of our two-way, forced choice experiment. The left figure shows the Venn 
diagram of boundary subsets. The thick circle encompasses the full boundary set S. Within S, the 
set of orphan labels is shown in green. The pB boundary set A is the dotted ellipsoid. The set 
of edges falsely identified by the algorithm, A\S, is highlit in red. In each trial, we randomly select 
one boundary segment from (green ring) and another from A\S (red ellipsoid) and ask subjects 
to judge which one is perceptually stronger Two boundary segments (high contrast squares with red 
lines) are superimposed onto the original image (shown in the middle figure). At the same time, the 
original is also presented to the subject in a separate window. In total, 100 image pairs are compared 
by all 5 subjects. The right figure shows the risk (that is, how often the false-alarm algorithmic edges 
are preferred over the human labels), of this database for all 5 subjects. Dotted line is chance level 
(0.5). 



to choose an appropriate threshold, and form a subset of boundary segments that balances risk and 
utility, which we refer to the total available number of data-points in the selected subset. In the next 
section, we present a graphical model that estimates the boundary perceptual strength. 

3 Model and inference 

During the labeling process, each subject I, governed by her/his internal psychophysical parame- 
ters 0', responds to segments of different perceptual strength Xi. For all the boundaries such that 
{i \ Xi = x}, the response yields a mixture of Bernoulli distributions, with parameter /i' (x). Fur- 
thermore, we assume /i'(x) yields a sigmoid functional form. The graphical model of the labeling 
process is shown in Fig. |2] 
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Figure 2: The graphical model of the labeling process. This model assumes that the label is de- 
termined probabilistically by the perceptual strength Xi and the response profile of the labeler /i', 
which is further controlled by a hidden parameter 6*' . The gray circle indicates the observed variable, 
which is the binary individual response to a boundary segment. The model outputs estimates of the 
perceptual strength of each boundary segment as well as the parameters of each labeler 

In our model, Xi yields a uniform distribution U{0, 1). /i'(x) = s{x, 0), where s(-) is the sigmoid 

function: s{x,9) = i_^^^^p^gi _gi ^-j ~^4- The conditional probability of is a soft voting of different 

fj,, such that P{yl — 1 | n\xi) = <j>a{xi — x)M'(x)dx^ where 4'a-{-) is the Gaussian probabilistic 
density function with zero mean and a standard deviation. We set a — 0.15. 

We use the EM algorithm to estimate 9' , /i'(x)^ and Xi. We start with Ei[yl] as the initial guess 
X*. In each iteration, the estimate of /i is given by /i'(x)* = Si vl't'/ji^* ~ x)- 9 is updated by 
6*'* — argmine {s{x, ^0~/^'(x))^dx- For the estimate of x, we have x* = argmaxj^. P{y\ \ 

Xi). The optimization process converges within 20 iterations. The distribution of the perceptual 
strength is shown in Fig. [3] 



3 



4 Experimental validation 



Given the inferred perceptual strengthes, we select 4 thresholds ti = 0.2, T2 = 0.5, tj, = 0.8, and 
T4 = 1, and formed 4 subsets Sr- of boundary segments. For each Sr- we use the pB algorithm to 
generate such that #iSri — Finally, a 5-subject experiment is conducted to evaluate the 

risk of Sn ■ For each image, we randomly choose a pair of boundary segments from and 8^ , and 
then take the majority voting of our subjects' responses to estimate the relative strength ordering. A 
total number of 500 trials are averaged to estimate the risk of each subset. The result is shown in 
Fig.S 
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Figure 3: Left 1: Initial guess of the perceptual strength distribution. Left 2: Final estimate of the 
perceptual strength distribution. Right 1 : Individual estimates of the risk of ■ In this figure, each 
color corresponds to one subject. Right 2: Risk estimate based on majority voting of all subjects. 
The dotted line in the right figure indicate the mode risk of in Fig.[T] 



5 Discussion and future works 

There are two main trends in the perceptual strength distributions shown in Fig. [3] First, the spiky 
distribution of initial guess has been successfully smoothed out, because each subject has his distinc- 
tive labeling characteristics and therefore their response weights differently to the estimated strength. 
Second, many of the boundary strengthes are automatically suppressed to zero. In fact, most of these 
zero-strength boundary segments correspond to the orphan labels, which are the biggest source of 
the dataset risk. From the right two figures, we see that the subset risk decreases as the perceptual 
strength threshold r goes up. This result supports the risk-utility model we mentioned in Sec. |2l 

We have shown that a human-labeled dataset, even if well constructed and tested, can contain serious 
risks that hinder its ability to evaluate algorithm performance. We first proposed a psychophysical 
test to estimate human dataset risk, where by risk we mean mistakenly classifying strong algorithmic 
boundaries as false alarms. We discuss an inference model to find the perceptual strength of each 
boundary segment, and use it to balance the risk utility trade-off. 

Due to space limitation, we are unable to discuss other factors such as the stability of labeler-image 
assignment and its influence on the perceptual strength estimation; the information-theoretic limit 
of the two-way force choice, and result variation by using different algorithms. These issues will be 
addressed in the journal submission of this paper [IJ. 
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